Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update XML dump file namespace version #288

Merged
merged 1 commit into from
Jun 7, 2024

Conversation

xxyzz
Copy link
Collaborator

@xxyzz xxyzz commented Jun 7, 2024

New dump files start from 20240601 use version 0.11: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038392

Please note with this change older dump files will not be extracted. I have checked en, zh and de editions and all these dump files don't have empty pages.

@xxyzz
Copy link
Collaborator Author

xxyzz commented Jun 7, 2024

The bz2 file at here https://github.com/tatuylonen/wiktextract/blob/master/tests/test-pages-articles.xml.bz2 also need to be updated after this pr is merged.

@kristian-clausal
Copy link
Collaborator

If the only change we do here is just update the namespace string, it feels like we shouldn't break older dump files. Is it possible to dynamically determine if the dump file is either 0.10 or 0.11 and pick between them in decompress_dump_file?

@xxyzz
Copy link
Collaborator Author

xxyzz commented Jun 7, 2024

I'll check how to use multiple xml namespaces in lxml's functions.

Use `*` wildcards to remove the namespace limitation.

New dump files start from 20240601 use version 0.11:
https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038392
@xxyzz
Copy link
Collaborator Author

xxyzz commented Jun 7, 2024

The code works fine on the new 20240601 zh edition dump file and is ready to be merged.

@kristian-clausal kristian-clausal merged commit 6811128 into tatuylonen:main Jun 7, 2024
5 checks passed
@kristian-clausal
Copy link
Collaborator

Thank you! If the dump file works on your side, I'll switch back to using -latest for kaikki.org.

@xxyzz
Copy link
Collaborator Author

xxyzz commented Jun 7, 2024

I didn't extract all pages, I only check if there are any empty pages and I think the Wikimedia developers have fixed the empty page bug.

@xxyzz xxyzz deleted the new_dump_xml_ns branch June 7, 2024 06:31
@xxyzz
Copy link
Collaborator Author

xxyzz commented Jun 7, 2024

I notice all 20240601 dump files' size are increasing compare to 0501 files. en: 1.1G -> 1.3G, fr: 588.7M -> 669.7M. And these files are compressed .bz2 files, extracted files will be larger. I hope the sever has enough disk spaces...

@kristian-clausal
Copy link
Collaborator

The 0501 files were the corrupted ones, so we're returning to the state that was in April, so it should (the most dangerous word) be fine.

@xxyzz
Copy link
Collaborator Author

xxyzz commented Jun 7, 2024

0520 files are corrupted and removed from dumps.wikimedia.org, 0501 files are fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants